AITopics | data contributor

Collaborating Authors

data contributor

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Wang, Haohui, Qi, Jingyuan, Chen, Jianpeng, Wu, Jun, Huang, Lifu, Zheng, Lecheng, Choi, Kevin, Veeramani, Balaji, Bowen, Edward, Hu, Alison, Cody, Tyler, Zhou, Dawei

arXiv.org Artificial IntelligenceNov-18-2025

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2511.1364

Country:

Europe (0.93)
North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Congratulations to the #AIES2025 best paper award winners!

AIHubOct-21-2025, 11:54:04 GMT

The eighth AAAI / ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) is currently taking place in Madrid, Spain, running from 20-22 October. During the opening ceremony, the best papers for this year were announced. While it is well-known that AI systems might bring about unfair social impacts by influencing social schemas, much attention has been paid to instances where the content presented by AI systems explicitly demeans marginalized groups or reinforces problematic stereotypes. This paper urges critical scrutiny to be paid to instances that shape social schemas through subtler manners. Drawing from recent philosophical discussions on the politics of artifacts, we argue that many existing AI systems should be identified as what Liao and Huebner called oppressive things when they function to manifest oppressive normality.

ai system, aies2025 best paper award winner, congratulation, (11 more...)

AIHub

Country:

Europe > Spain > Galicia > Madrid (0.26)
Asia > China (0.06)
North America > United States (0.05)

Genre: Personal > Honors > Award (0.41)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.52)
Information Technology > Communications > Social Media (0.50)
Information Technology > Artificial Intelligence > Natural Language (0.49)

Add feedback

RecPS: Privacy Risk Scoring for Recommender Systems

He, Jiajie, Gu, Yuechun, Chen, Keke

arXiv.org Artificial IntelligenceSep-9-2025

Recommender systems (RecSys) have become an essential component of many web applications. The core of the system is a recommendation model trained on highly sensitive user-item interaction data. While privacy-enhancing techniques are actively studied in the research community, the real-world model development still depends on minimal privacy protection, e.g., via controlled access. Users of such systems should have the right to choose \emph{not} to share highly sensitive interactions. However, there is no method allowing the user to know which interactions are more sensitive than others. Thus, quantifying the privacy risk of RecSys training data is a critical step to enabling privacy-aware RecSys model development and deployment. We propose a membership-inference attack (MIA)- based privacy scoring method, RecPS, to measure privacy risks at both the interaction and user levels. The RecPS interaction-level score definition is motivated and derived from differential privacy, which is then extended to the user-level scoring method. A critical component is the interaction-level MIA method RecLiRA, which gives high-quality membership estimation. We have conducted extensive experiments on well-known benchmark datasets and RecSys models to show the unique features and benefits of RecPS scoring in risk assessment and RecSys model unlearning.

artificial intelligence, interaction, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3705328.3748052

2507.18365

Country: North America > United States > Maryland (0.68)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Secondary Stakeholders in AI: Fighting for, Brokering, and Navigating Agency

Ajmani, Leah Hope, Abdelkadir, Nuredin Ali, Chancellor, Stevie

arXiv.org Artificial IntelligenceJun-10-2025

As AI technologies become more human-facing, there have been numerous calls to adapt participatory approaches to AI development -- spurring the idea of participatory AI. However, these calls often focus only on primary stakeholders, such as end-users, and not secondary stakeholders. This paper seeks to translate the ideals of participatory AI to a broader population of secondary AI stakeholders through semi-structured interviews. We theorize that meaningful participation involves three participatory ideals: (1) informedness, (2) consent, and (3) agency. We also explore how secondary stakeholders realize these ideals by traversing a complicated problem space. Like walking up the rungs of a ladder, these ideals build on one another. We introduce three stakeholder archetypes: the reluctant data contributor, the unsupported activist, and the well-intentioned practitioner, who must navigate systemic barriers to achieving agentic AI relationships. We envision an AI future where secondary stakeholders are able to meaningfully participate with the AI systems they influence and are influenced by.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3715275.3732071

2506.07281

Country:

Europe > United Kingdom > England (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine (0.94)
Social Sector (0.93)
Law (0.93)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.67)

Add feedback

Position: The Most Expensive Part of an LLM should be its Training Data

Kandpal, Nikhil, Raffel, Colin

arXiv.org Artificial IntelligenceApr-18-2025

Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models' training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.12427

Genre: Research Report (0.53)

Industry:

Law (1.00)
Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation

Gu, Yuechun, He, Jiajie, Chen, Keke

arXiv.org Artificial IntelligenceOct-29-2024

Training data privacy has been a top concern in AI modeling. While methods like differentiated private learning allow data contributors to quantify acceptable privacy loss, model utility is often significantly damaged. In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. In controlled data access, authorized model builders work in a restricted environment to access sensitive data, which can fully preserve data utility with reduced risk of data leak. However, unlike differential privacy, there is no quantitative measure for individual data contributors to tell their privacy risk before participating in a machine learning task. We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task. The demo source code will be available at \url{https://github.com/RhincodonE/demo_privacy_scoring}.

artificial intelligence, machine learning, privacy, (13 more...)

arXiv.org Artificial Intelligence

2410.22651

Country: North America > United States > Utah > Salt Lake County > Salt Lake City (0.05)

Genre: Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Models

Park, Chanjun, Ha, Hyunsoo, Kim, Jihoo, Kim, Yungi, Kim, Dahyun, Lee, Sukyung, Yang, Seonghoon

arXiv.org Artificial IntelligenceSep-30-2024

In this paper, we propose the 1 Trillion Token Platform (1TT Platform), a novel framework designed to facilitate efficient data sharing with a transparent and equitable profit-sharing mechanism. The platform fosters collaboration between data contributors, who provide otherwise non-disclosed datasets, and a data consumer, who utilizes these datasets to enhance their own services. Data contributors are compensated in monetary terms, receiving a share of the revenue generated by the services of the data consumer. The data consumer is committed to sharing a portion of the revenue with contributors, according to predefined profit-sharing arrangements. By incorporating a transparent profit-sharing paradigm to incentivize large-scale data sharing, the 1TT Platform creates a collaborative environment to drive the advancement of NLP and LLM technologies.

arxiv preprint arxiv, contributor, data contributor, (10 more...)

arXiv.org Artificial Intelligence

2409.20149

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

AccessShare: Co-designing Data Access and Sharing with Blind People

Kamikubo, Rie, Zeraati, Farnaz Zamiri, Lee, Kyungjun, Kacorri, Hernisa

arXiv.org Artificial IntelligenceJul-27-2024

Blind people are often called to contribute image data to datasets for AI innovation with the hope for future accessibility and inclusion. Yet, the visual inspection of the contributed images is inaccessible. To this day, we lack mechanisms for data inspection and control that are accessible to the blind community. To address this gap, we engage 10 blind participants in a scenario where they wear smartglasses and collect image data using an AI-infused application in their homes. We also engineer a design probe, a novel data access interface called AccessShare, and conduct a co-design study to discuss participants' needs, preferences, and ideas on consent, data inspection, and control. Our findings reveal the impact of interactive informed consent and the complementary role of data inspection systems such as AccessShare in facilitating communication between data stewards and blind data contributors. We discuss how key insights can guide future informed consent and data control to promote inclusive and responsible data practices in AI.

accessshare, descriptor, participant, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3663548.3675612

2407.19351

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > New York > New York County > New York City (0.06)
North America > Canada > Newfoundland and Labrador > Newfoundland > St. John's (0.05)
(6 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Social Media (1.00)
(5 more...)

Add feedback

Mendata: A Framework to Purify Manipulated Training Data

Huang, Zonghao, Gong, Neil, Reiter, Michael K.

arXiv.org Artificial IntelligenceDec-2-2023

Untrusted data used to train a model might have been manipulated to endow the learned model with hidden properties that the data contributor might later exploit. Data purification aims to remove such manipulations prior to training the model. We propose Mendata, a novel framework to purify manipulated training data. Starting from a small reference dataset in which a large majority of the inputs are clean, Mendata perturbs the training inputs so that they retain their utility but are distributed similarly (as measured by Wasserstein distance) to the reference data, thereby eliminating hidden properties from the learned model. A key challenge is how to find such perturbations, which we address by formulating a min-max optimization problem and developing a two-step method to iteratively solve it. We demonstrate the effectiveness of Mendata by applying it to defeat state-of-the-art data poisoning and data tracing techniques.

dataset, detector, mendata, (15 more...)

arXiv.org Artificial Intelligence

2312.01281

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Nepal (0.04)

Genre: Research Report > New Finding (0.94)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Proof-of-Federated-Learning-Subchain: Free Partner Selection Subchain Based on Federated Learning

Li, Boyang, Shen, Bingyu, Lu, Qing, Jung, Taeho, Shi, Yiyu

arXiv.org Artificial IntelligenceJul-30-2023

The continuous thriving of the Blockchain society motivates research in novel designs of schemes supporting cryptocurrencies. Previously multiple Proof-of-Deep-Learning(PoDL) consensuses have been proposed to replace hashing with useful work such as deep learning model training tasks. The energy will be more efficiently used while maintaining the ledger. However deep learning models are problem-specific and can be extremely complex. Current PoDL consensuses still require much work to realize in the real world. In this paper, we proposed a novel consensus named Proof-of-Federated-Learning-Subchain(PoFLSC) to fill the gap. We applied a subchain to record the training, challenging, and auditing activities and emphasized the importance of valuable datasets in partner selection. We simulated 20 miners in the subchain to demonstrate the effectiveness of PoFLSC. When we reduce the pool size concerning the reservation priority order, the drop rate difference in the performance in different scenarios further exhibits that the miner with a higher Shapley Value (SV) will gain a better opportunity to be selected when the size of the subchain pool is limited. In the conducted experiments, the PoFLSC consensus supported the subchain manager to be aware of reservation priority and the core partition of contributors to establish and maintain a competitive subchain.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2307.16342

Country:

North America > United States > Massachusetts (0.04)
North America > United States > Indiana > St. Joseph County > Notre Dame (0.04)

Genre: Research Report (0.40)

Industry:

Information Technology (0.46)
Materials > Metals & Mining (0.37)
Banking & Finance > Trading (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback